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Abstract.  This  paper  describes  a  mathematical 
programming  approach  to  finding  an  optimal, 
hetcreogencous  suite  of  processors  to  solve 
supercomputing  problems.  This  technique,  called 
suEcrccncurrency,  worka  uuu  wucn  tne  computational 
requirements  are  diverse  and  significant  portions  of  the 
code  are  not  tightly-coupled.  It  is  also  dependent  on  new 
methods  of  benchmarking  and  code  profiling,  as  well  as 
eventual  use  of  AI  techniques  for  intelligent  management 
of  the  selected  supcrconcurrent  suite. 

Keywords.  Superconcurrcncy,  Supcrcomputing, 
Code  Profiling,  Benchmarking,  Optimal  Selection, 
Amdahl's  Law 

1.  Objective 

Ercegovac  [1988J  has  recently  looked  at  the 
feasibility  of  a  suite  of  heterogeneous  processors  to  solve 
sunercomputing  problems.  Rcsmkoff  [  1987 1  and  Kamcn 
[19891  have  examined  the  cost-effectiveness  of 
supercomputers  (one  generally  finds  the  smaller  mini- 
supers  to  be  more  cost-effective  than  the  largest-sized 
processors).  Bokhari  [1988]  has  investigated  partitioning 
problems  among  various  types  of  processors.  There  are 
several  reasons  for  partitioning.  First  many  large  codes 
have  diverse  computational  types.  Second  the  vari 
super-speed  parallel  and  vector  processors  have  quit 
different  performance  profiles  on  these  types,  often 
amounting  to  several  orders  of  magnitude.  It  is  a 
commonplace  observation  and  a  corollary  of  Amdahl's 
Law  [1967]  that  any  single  type  of  supercomputer,  often 
spends  most  of  its  time  computing  code  types  for  which 
it  is  poorly  designed.  If  we  could  configure  our  processor 
suite  so  that  each  processor  could  spend  almost  all  its 
time  on  code  for  which  it  is  well  designed,  the  overall 
increase  in  speed  could  be  orders  of  magnitude  over  what 
is  now  achieved  by  conventional  supcrcomputing. 

Superconcurrcncy  is  a  general  technique  for 
matching  and  managing  optimally  configured  suites  ol 
super-speed  processors.  In  particular  this  paper  shows  a 
general  method  for  choosing  the  most  powerful  suite  ol 


heterogeneous  parallel  and  vector  supercomputers  fora 
given  problem  set,  subject  to  a  fixed  constraint,  such  as 
cost.  The  duai  problem  could  find  minimal  cost 
configuration  lor  a  fixed  speed  requirement.  Thus  the 
Optimal  Selection  Theory  is  a  mathematical  program  for 
which  one  wishes  to  minimize  the  total  time  spent  on 
the  sum  of  all  code  subsegments.  The  method  is 
mathematically  dependent  on  a  new  methodology  of  axle 
profiling  of  the  problem  seLs  being  implemented  and  a 
new  mcthodi'.ogy  cf  analytical  benchmarking.  The 
formulation  also  rests  on  two  mathematical  assumptions 
outlined  below.  The  intent  is  to  use  this  technique  to 
provide  supcrcomputing  power  for  Naval  Command  and 
Control  (C 2)  problems,  however  this  paradigm  should 
work  for  many  classes  of  supcrcomputing  problems.  The 
basic  result  is  that  for  a  computational  problem  that  has  a 
diverse  set  of  computational  types,  not  all  tightly- 
coupled,  the  optimal  solution  is  a  heterogeneous  suite  of 
parallel  and  vector  processors  rather  than  a  single 
supercomputing  architecture.  This  solution  is  called 
superconcurrcncy  both  lxecau.se  it  is  an  approach  to 
supcrcomputing  and  because  it  concurrently  uses 
concurrent  (vector  and  parallel)  processors. 

2.  Mathematical  Formulation 

Let  us  state  the  basic  problem  as  a  linear  (actually 
integer)  program.  We  want  to  get  the  most  power  we 
can,  given  some  overall  cost  constraint.  More 
mathematically  we  wish  to  maximize  the  power  (or 

speed)  function,  P.  We  do  this  by  minimizing  a  time 

function,  T,  giving  the  time  Utken  on  a  code,  so  that  P 

=  T"  l .  T  is  defined  on  the  two-variable  range,  'X.  x 

S.  X  is  the  set  of  potential  machine  choices,  X.  = 

{ Xj }  where  the  Xj  are  candidate  architectures. 

S  is  a  non-overlapping  set  of  all  code  subsegments,  Sj; 

thus  S  =  U  Sj  and  Sj  O  =  0  il  j  ^  k.  The 
choice  of  the  Sj  defines  the  code  profiling  and  analytical 
benchmarking  problem.  We  will  shortly  look  at 


SOLUTION  SUITE 


motivation  and  explanation  of  this  point.  We  denote  C 
as  the  overall  cost  constraint,  {  C  j  ]  as  the  set  of  costs 
corresponding  to  the  {  Xj }  and  {tj}  as  the  set  of 
corresponding  lime  functions,  i.e.,  tj(Sj)  is  the  time 

taken  by  machine  X;  on  code  segment  Sj.  Let  I 
denote  the  set  of  all  possible  indices  of  one  machine  type 
per  segment  with  Vj  denoting  the  number  of  such 
machines  used  per  segment.  Let  Vj  be  the  number  of 
machines  of  type  i  (which  may  be  0  if  machine  X  j  mil 
in  indexed  configuration).  Then  the  mathematical 
programming  problem  can  now  be  stated  as: 

(1)  MINIMIZE 

T(Xj,  Sj)  =  £ir  |  ,  ti(Sj)  /Vj 

such  that  2LjVj  Cj  <  C 

Because  of  assumptions  we  will  be  able  to  make 
later  about  linearity,  we  will  want  to  group  together  all 
code  subsegments  of  the  same  type,  e.g.,  vector  i/able 
code  subsegements,  or  ones  susceptible  to  parallelism. 

Thus  we  will  divide  S  into  equivalence  classes,  Gj,  by 
code  type.  For  convenience  we  will  index  these  classes 
by  the  ‘^st  subsegment  of  that  type,  e.g., 

G  j  =  {  Sjl  S  i  =  Sj,  where  S  denotes  having  same 

code  type).  We  will  later  use  the  notation  t j ( Gj^ )  to 

denote  tj(G|^)  =  S  t j ( S j )  for  all  Sj  f  G^. 

3.  Vector-only  Example 

Let  us  consider  first  an  intuitive  example  using 
primarily  vectori/able  code.  Typically  some  portion  ol 
the  code  will  be  decomposable  and  some  non- 
decomposablc.  By  decomposable  vector  axle  we  mean 
vectori/able  code  for  which  the  original  problem  can  be 
broken  up  into  some  small  number  of  independent 
vectori/able  subaxles.  Let  us  consider  a  specific  axle 
example.  Suppose  it  to  be  SOT  decomposable 
vectori/able  and  5()9f  non-dccomjxrsable  vectori/able. 
Suppose  further  that  the  overall  cost  constraint  is  S2M 
and  that  we  have  a  choice  of  three  machines.  Machine  x  | 
costs  S2N1  and  speeds  up  vectori/able  axle  (over  sonic 
baseline  scalar  system)  by  a  factor  of  5.  Machine  \s 
costs  SIM  and  speeds  up  vectori/able  code  by  a  factor  of 
of  4.  Finally  machine  x  ^  costs  SI/.4M  and  speeds  up 
vectori/able  code  by  V  We  can  enumerate  the  possible 
solutions  that  satisfy  the  cost  constraint,  namely: 


P 


1 

1  X| 

5.00 

■> 

2  xx 

5,4  4 

4 

1  xy  -t  4  x  i 

6.00 

4 

6  x  i 

5.14 

Solution  1  clearly  gives  an  overall  speed-up  factor 

(on  the  loud  vectori/able  code)  of  P  —  S.  Solution  5 
gives  a  s|K’ed- up  (actor  of  4  on  the  mm -decomposable 
sector  portion  ton  which  wc  use  machine  X  ->  i  and  a 
speedup  of  a  little  Ivtier  than  12  (12.8)  on  the 
decomposable  portion.  The  time  relative  to  the  original 
scalar  baseline  is  1/2  ’  1/4  for  the  half  of  the  vector  code 
that  is  non-decomposable.  Assuming  the  decomposable 
code  is  distributed  evenly  over  all  four  machines,  the  time 
for  the  other  half  of  the  cotie  is  1/2  *  1/4  *  (.75  *  1/4  + 
.25  *  1/4).  Summing  these  two  yields  a  total  time 
relative  to  the  original  scalar  baseline  time  of  1/8  + 

5/128  or  21/128.  'Ulus  it  has  an  overall  speed-up  factor 

of  P  =  6.(W  (128/21 ).  Similarly  we  can  compute  that 

for  solution  2,  P  =  5.4.4  (16/4)  because  the  total  time 
relative  to  the  original  scalar  baseline  is  1/2  *  1/4  +  1/2 
*  (1/2  *  1/4)  =  4/16.  For  solution  4  the  relative  total 

time  is  1/2  *  1/4  +  1/2  *  (1/6  *  1/5)  so  P  =  5.14 
(.46/7).  Thus  solution  .4  is  optimal. 

4.  Underlying  Assumptions 

One  of  the  fundamental  performance  limiting 
factors  of  vector  or  parallel  supercomputers  has  been  the 
performance  characteristics  of  such  architectures  on  codes 
and  algorithms  for  which  they  are  not  well  designed. 

This  is  because  such  machines  have  generally  Ix'en  used 
on  entire  codes  and  large  codes  usually  have  signiticant 
portions  which  are  not  vectori/able  or  portions  which  are 
not  parallel.  The  underlying  motivation  for  this  paper  is 
the  desire  to  understand  how  to  map  the  various  atomic 
code  portions  to  the  types  of  architectures  for  which  they 
are  Ix'st  suited.  Three  of  the  resulting  needs  that  follow 
from  this  are  a  distributed  intelligent  network  system 
(I)INS)  to  control  anil  sc  hedule  such  codc/architecture 
matching,  and  new  kinds  of  anaK  tical  benchmarking  and 
code  proliling  (briefly  discussed  below). 

There  tire  two  fundamental  mathem  ihcal 
assumptions  used  in  this  paper.  One  is  the  linearity  of 

each  machine  s  characteristic  time  function,  tj.  i.e., 

tj(Sj)  is  assumed  to  be  linear  in  code  segment  length 

ISjl  (or  implied  length  due  to,  say.  DO  loops). 

Actually  this  is  normally  an  invalid  assumption  when 
considering  the  relationship  ol  machine  behavior  across 
code  types!  However  we  wall  be  making  this  assumption 


only  when  considering  the  e fleet  of  any  machine  ty  pe 
against  optimally  matched  axle.  This  is  equivalent  to 
saying  that  each  type  of  architecture  acts  linearly 
(asymptotically)  on  that  type  of  code  for  which  it  is  best 
suited,  e.g.,  a  vector  machine  takes  twice  as  long  to 
process  a  vector  of  lengtn  10(H)  as  of  length  500.  It  is 
precisely  because  to  this  assumption  of  linearity  that  we 

can  legitimately  refer  to  tj(Oj)  as  well  as  tj(Sj). 
Similarly  we  will  assume  the  decrease  in  time  due  to 
multiple  copies  of  the  same  machine  on  optimally 
matched  and  decomposable  axle  to  lx‘  linear. 

5.  Mathematical  Reformulation 

Since  we  are  supposing  all  the  time  functions,  tj. 
to  be  linear,  vve  can  write  cuJ:  such  function  uc: 

ti(CJj)  =  OCitolOj)  +  Pi 

where  the  P,  are  overhead  factors  the  OCj  are  speed  up 
factors  (thus  OCj  are  such  that  typically  0  <  CXj  « 

1 ,  and  to  represents  some  baseline  time  function  such 
as  that  derived  from,  say,  a  VAX.  Since  we  are  primarily 
interested  in  asymptotic  results,  we  will  ignore  the  pj 
and  simplify  the  expression  to  be: 

ti(Oj)  =  aitp(Gj) 

Now  for  any  given  code,  S  ,  let  it  (or  at  least  its  "hot 
spots")  be  divided  into  non-overlapping  subsegment  sets, 
(7j .  Let  Tlj  be  the  percentage  of  time  spent  by  the 
baseline  machine  on  code  subsegment  set  (7j  i.e., 

t{)( Oj )  =  JCj.  Thus  we  have  the  baseline  percentages 
given  by: 

T()(S  )  =  t()  (XjOj)  =  XjtolGj) 


Finally  w'c  need  to  deal  with  the  speed-ups  given  in  the 
decomposable  cases  (vector  decomposable  or  all  parallel 
types).  As  mentioned  above  we  assume  that  for  V 
processors  this  will  give  a  linear  speed-up  of  V,  i.e..  for 
V  processors  with  characteristic  time  function,  tj,  we 

will  have  total  time  given  by  tj  /vj.  It  will  be 

understood  thatVj  is  1  for  non-decomposable  cases. 
This  gives  us  a  reformulation  of  the  original 
mathematical  program  as: 


(2)  MINI  ML/ L 

T  (  X  j ,  G  | )  —  |  f  |  jt  j(  G  j  )/v  j 

—  S  j  t  I .  j  U  i  TtjA  i 

such  that  Z,V,  Cj  <  0 

(where  Vj  may  be  0  if  machine  Xj  not  in  optimal 
configuration) 

6.  Multi-type  Fxample 

Let  us  consider  a  more  complicated  example  than 
earlier.  In  this  we  shall  have  different  types  of  vector, 
scalar,  anil  parallel  subcode.  We  will  suppose  we  have 
axle  that  is  509)  vectorizable  (359)  non-decomposable 
and  159)  decomposable),  209)  line-grain  parallel,  20f.i 
coarsc-grain  parallel,  and  109)  scaly  For  each  of  the 
possible  machines  below,  we  shall  give  spec..  -up  Factors 
on  code  for  which  llu.  >  are  designed,  as  well  as  scalat 
speed-up  (over  some  baseline),  in  case  the  machine  is 
used  for  scalar  axle.  We  shall  also  assume  that  cacti  type 
of  machine  only  achieves  scalar  speed  on  code  for  which 
it  is  not  designed,  e.g.,  a  vector  machine  will  be  assumed 
to  get  only  scalar  speed  on  parallel  axle.  Suppose  our 
overall  cost  constraint  is  $4M.  Let  our  vector  machines 
be:  x  j  with  cost  S4M,  vector  speed  up  of  10,  and  scalar 
speed-up  2,  xt  with  cost  SIM,  vector  speed-up  of  5. 
and  scalar  speed-up  of  2.  and  x  ^  u  ith  cost  S I/3M.  vcvlor 
speed-up  of  3,  and  scalar  speed-up  of  1 .  Let  our  fine- 
grain  parallel  machines  be:  X4  with  cost  SIM,  parallel 
speed-up  of  1 5,  X5  w  ith  cost  S1/3M,  parallel  speed-up  of 
A,  and  scalar  speed-up  of  1 .  Let  our  coarse-grain  parallel 
machines  be:  xp,  with  cost  SIM,  parallel  speed-up  of  4. 
scalar  speed-up  of  1,  and  X7  with  cost  S1/3M,  parallel 
speed-up  of  2,  diui  scalar  speed-up  ol  1 .  Finally  let  xg  be 
a  scalar  machine  with  cost  S1/4M  and  speed-up  of  2. 

One  solution  would  be  to  spend  S4M  on  the 
highest  performing  vector  machine,  x  j,  and  in  fact  this  is 
the  traditonal  supercomputing  solution.  This  solution 
gives  a  speed-up  factor  of  10  on  the  vector  portion  and  a 
speed-up  of  2  on  the  rest.  Thus  its  total  time  relative  to 
the  original  scalar  baseline  is  1/20  +  1/4  =  r'/20.  Thus 
it  has  an  overall  speed-up  factor  of  P  =  3.33.  This 
solution  is  not  nearly  optimal!  Consider  a  solution  of 
one  xn.  three  x^  ,  one  rg  ,  and  one  \f,.  This  solution 
has  overall  speed-up  ol  5  on  the  non-decomposable 
vector,  better  than  12  on  the  decomposable,  15  on  the 
fine -grain,  4  on  the  coarse-grain,  and  2  on  the  scalar. 

Thus  its  total  time  relative  to  the  original  scalar  baseline 
is  .35/5  +  .15  *  1/4  *  (.25  *  1/5  +  .75  *  1/3)  +  .2/1  s  + 
.2/4  +  .1/2.  Thus  it  has  an  overall  speed-up  factor  of  P  = 
5.139.  Furthermore  this  does  not  even  consider  the 
secondary  advantage  of  this  multi-machine  solution  that 


the  machines  not  being  used  on  this  code  at  any  one  time 
are  available  for  other  work.  This  example  is,  we 
believe,  representative  of  a  wide  class  of  supercomputing 
problems  in  which  the  best  total  speed-up  comes  from  a 
multi-machine  solution  in  which  no  one  machine  is  a 
traditonal  su|<rcomputer. 

7.  Mixed  Strategies 

There  is  an  extension  of  the  mathematical 
program,  (2),  above  w  hen  we  want  a  mixed  strategy  of 
optimizing  several  project  applications  with  vary  ing 

priorities.  Suppose  there  are  Q  projects,  p^.,  (k=l, .  .  . 
Q)  with  relative  weightings  and  A^  =  1 . 

Where  several  application  projects  need  to  be 
accomodated  by  the  optimum  choice  of  a  mix  of 
processors,  we  may  weight  the  times  spent  on  the  code 
profiles  types  in  each  project  by  the  importance  of  that 
project  and  use  the  sums  of  these  derivative  times  (again 
suitably  calibrated  to  add  up  to  1)  to  determine  the 
relative  need  for  processors  to  handle  different  code  profile 
types.  For  example,  suppose  there  arc  just  two  projects, 
a  high-priority  one  with  weighting  A.]  =  0.7,  and  a  low- 

priority  one,  with  A.  2  ~  03.  If  a  particular  profile  type 
appears  in  just  the  high-priority  project,  the  percentage  of 
time  that  it  takes  up  it.  die  high-priority  project  can  be 
weighted  by  0.7,  whereas  if  it  appears  also  in  the  low- 
priority  project,  the  percentage  of  time  that  profile  type 
takes  in  the  low-priority  project  weighted  by  0.3  would 
be  added  to  tire  original  weighted  value  to  give  the  overall 
value  to  contribute  to  minimizing  time  under  the  cost 
constraints. 

Suppose  there  are  M  profile  types,  rri|., 

(k=i . M)  with  relative  time  requirements  that  arc 

calibrated  to  sum  to  1  for  each  project.  Then  we  have  an 
\1  X  Q  matrix  to  express  the  distribution  of  time 
needed  to  handle  the  code  profile  types  for  the  projects  and 
a  Q  x  1  matrix  of  importance  values  associated  with  die 
various  projects  which  can  be  multiplied  to  produce  an 
M  x  1  matrix  of  weighted  times  for  the  different  code 
profiles.  These  weighted  times  can  then  be  used,  once 
scaled  to  add  to  1,  as  the  7Tj  for  the  code  profiles  in  (2) 
above. 

If  the  priorities  of  the  different  projects  are  used  in 
tiiis  way,  the  original  assignment  of  the  priorities  to 
projects  will  need  to  take  into  account  the  quantity  of 
code  for  the  project,  since  the  process  of  scaling  of  the 
ii m  ics  needed  for  code  of  different  types  is  done  by  project, 
and  distinctions  of  code  volume  among  the  projects  are 

thereby  erased.  If  there  are  Q  projects  each  w  ith  some 


proportion  of  the  total  importance  value  of  1,  we  have  for 
the  jlh  code  profile  type: 

<*j  =  ^k  ^k  *  Kj.k 

The  new  Oj's  can  be  scaled  so  dial  they  add  up  to  1 
to  produce  new  Tlj's  to  replace  the  old  7tj  s  in  equation 
(2>.  This  revision  will  permit  the  evaluation  of  the 
optimum  mix  of  processors  to  be  sensitive  to  priorities 
favoring  some  projects.  Note  that  "project"  can  be 
interpreted  as  "project  application"  if  it  is  felt  that  w  ithin 
a  project  some  applications  have  different  priorities  than 
others.  With  diis  interpretation,  Q  will  just  be  a  larger 
real  number. 

8.  Code  Types  and  Benchmarking 

Benchmarking  anti  C’txlc  Profiling  -  As  discussed 
earlier,  the  basic  approach  of  this  paper  is  contingent 
upon  breaking  down  the  overall  code  into  groups  of 
segments  within  which  the  processing  requirements  arc 
the  same  or  homogeneous.  The  segments  of 
homogeneous  type  arc  assigned  to  optimal  processors  for 
that  type.  Before  that  can  be  done,  it  is  necessary  to  take 
two  benchmarking  type  steps.  The  first,  called  code-type 
profiling  is  a  code  specific  function  to  identify  the 
"natural"  types  of  code  that  are  actually  present  and  group 
the  code  segments  by  type.  Types  dial  might  be 
identified  include  vcctnrizablc  decomposable,  vcctorizablc 
non-dccomposable,  finc/coarse-grain  parallel, 
S1MD/MIMD  parallel,  scalar,  special  purpose,  c.g.,  FFT 
or  specialized  sort  algorithm,  etc.  The  second  step,  called 
analytical  benchmarking  is  an  analysis  of  how  the 
available  processors  perform  on  the  identified  types,  2c., 
this  identifies  processors  that  arc  appropriate  solutions  for 
each  code  type.  Thus  it  is  more  analytical  than  some 
previous  techniques  that  simply  looked  at  the  overall 
result  of  running  a  processor  on  a  benchmark  entire  code 
or  set  of  loops  (without  any  real  analysis  of  how'  the 
myriad  of  relevant  factors  contributed).  However  it 
should  be  pointed  out  that  recent  research  by  Dongarra 
1 1989]  on  LINPACK  and  Murphy  [  1 989]  on  Livermore 
Loops  provides  some  insight  to  the  processes  involved. 
Both  ctxic  profiling  and  analytical  benchmarking  are  now 
being  undcriakcn  by  the  Superconcurrency  Research  Team 
(SRT)  at  the  Naval  Ocean  Systems  Center  (NOSC).  Oui 
initial  research  at  Profiling/Benchmarking  was  directed  at 
several  large  Naval  C2  problems  and  a  suite  of 
potentially  matching  mini-supcrs/parallel  processors 
(including  small  Connection  Machine,  DAP,  Ardent, 
Encore,  Butterfly,  and  Convex).  Most  of  the  C2 
applications  w'o  have  looked  at  so  far  have  been  relatively 
loosely-coupled  and  we  have  found  it  feasible  to  break 
them  up  (manually)  into  homogeneous  portions  and 
assign  them  to  appropriate  processors.  From  the 
processor  (benchmarking)  point  of  view,  our  most 
interesting  result  to  dale  is  how  consistently  the  long 
vector  problems  are  much  better  done  on  SIMD 


(Connection  Machine  or  DAP)  processors  rather  titan 
vector  processors. 

Bandwidth  and  Mixed  Types  -  Tightly  and 
medium-coupled  portions  of  code  will  he  more  difficult  to 
break  up  and  assign  to  different  processors  and  the  ability 
to  do  this  will  rest  in  part  on  the  bandwidths  of  die 
storage  devices  and  distributed  network  used.  In  these 
cases,  it  may  be  necessary  to  assign  mixed  type  code  to 
the  best  processor  available.  This  can  always  be  done 
optimally  with  a  superconcurrent  approach  but  on  an  ad 
hoc  basis  with  reduced  theoretical  value. 

Distributed  Intelligent  Network  System  (DINS)  - 
One  of  the  most  active  current  research  areas  of  the  SRT 
is  DINS.  DINS  will  be  a  reasoning  system  dial  uses 
information  from  Code  Profiling,  Analytical 
Benchmarking,  and  network  bandwidth  to  optimally 
assign  portions  of  code  to  appropriate  processors.  In  a 
general  sense,  this  is  similar  to  current  research  in  load 
balancing  and  priority  assignment.  However  the 
information  to  be  used  will  he  the  dace  sources 
mentioned  above  widt  the  primary  aim  of  optimal 
matching  code  portions  to  processors  rather  than  (the 
secondary)  factors  of  load  balancing  and  priority 
assignment.  Since  DINS  will  reason  about  processors 
actually  available  to  it,  this  means  that  we  can  achieve 
configuration  control  at  different  sites  even  though  there 
may  be  a  different  superconcurrent  suite  at  each. 

Similarly  DINS  will  continue  to  function  and  assign  a 
second  best  processor  if  a  first  choice  is  unavailable  or 
down.  Thus  DINS  is  robust  and  survivable.  Likewise  it 
is  compatible  with  evolutionary  development;  when  a 
new  processor  is  introduced  because  of  changing 
technology,  we  simply  replace  the  old  benchmarking  data 
with  the  new.  The  features  of  robustness,  configuration 
control,  survivability,  tailorability,  and  evolutionary' 
development  arc  essential  for  Naval  C2  problems. 

9.  Superconcurrency 

The  underlying  premise  of  this  paper  is  that  many 
codes,  and  particularly  many  sets  of  codes,  have  a 
heterogeneous  set  of  computational  types.  The  solution, 
called  supcrconcurrency,  is  nothing  more  than  the 
commonsensical  approach  of  selecting  a  heterogeneous 
suite  of  processors  that  most  effectively  addresses  this 
diverse  set  of  requirements.  The  solution  is  expressed  as 
a  mathematical  program  with  all  that  implies  about  the 
existence  of  an  optimal  solution.  This  approach  requires  a 
more  analytical  way  of  benchmarking  and  axle  profiling 
in  order  to  analyze  the  power  of  various  processors  on 
atomic  portions  of  code.  Supcrconcurrency  has  the 
potential  of  achieving  orders  of  magnitude  greatei  speed 
over  conventional  supercomputers  it  the  code  profiling 
techniques  show  the  overall  application  to  be  quite 
diverse  in  its  requirements.  The  future  addition  of  a 
Distributed  Intelligent  Network  System  to  manage  a 
supcrconcurrent  suite  of  vector  and  parallel  processors 
offers  the  potential  of  robustness,  configuration  control. 


survivability,  tailorability,  and  evolutionary 
development. 
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